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Abstract 

We study a problem of detecting priming events based on a time 
PQ ■ series index and an evolving document stream. We define a priming 

Q . event as an event which triggers abnormal movements of the time se- 

c/p \ ries index, i.e., the Iraq war with respect to the president approval 

index of President Bush. Existing solutions either focus on organizing 
coherent keywords from a document stream into events or identifying 
correlated movements between keyword frequency trajectories and the 
00 ' time series index. In this paper, we tackle the problem in two major 

■ steps. (1) We identify the elements that form a priming event. The 
, element identified is called influential topic which consists of a set of 

coherent keywords. And we extract them by looking at the correlation 

■ between keyword trajectories and the interested time series index at 
CN . a global level. (2) We extract priming events by detecting and orga- 
nizing the bursty influential topics at a micro level. We evaluate our 
algorithms on a real-world dataset and the result confirms that our 

^ . method is able to discover the priming events effectively. 

«?; 

1 Introduction 



With the increasing text information published on the news media and social 
network websites, there are lots of events emerging every day. Among these 
various events, there are only a few that are priming and can make significant 
changes to the world. In this paper, we measure the priming power of the 
events based on some popular index that people are usually concerned with. 

Governor's Approval Index Every activity of the governor can generate 
reports in the news media and discussions on the web. However only a few of 
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Figure 1: Approval Index of President Obama 



them will change people's attitude towards this governor and pose impacts on 
his/her popularity, i.e., approval index [12]. Since these approval indices are 
the measures of the satisfaction of citizens and crucial for the government, 
people would be more interested in knowing the events that highly affect the 
approval index than those with little impact. 

Financial Market Index There are a lot of events related to the company 
or the stock market happening while only some of them would change the 
valuation of the company in investors' mind. The greedy/fear will drive them 
to buy/sell the stock and eventually change the stock index. Therefore, the 
investors will be eager to know and monitor these events as they happen. 

These real world cases indicate that there is a need to find out the priming 
events which drive an interested time series index. Such priming events are 
able to help people understand what is going on in the world. However, 
the discovery of such priming events poses great challenges: 1) with the 
tremendous number of news articles over time, how could we identify and 
organize the events related to a time series index; 2) several related events 
may emerge together at the same time, how can we distinguish their impact 
and discover the priming ones; 3) as time goes by, how could we track the 
life cycle of the priming events, as well as their impact on the evolving time 
series index. 

Some existing work has focused on discovering a set of topics/events from 
the text corpus [TJ E] and tracking their life cycle [13] . But these methods 
make no effort to guarantee these events are influential and related to the 
index that people are concerned with. There is another stream of work con- 
sidering the relationship between the keyword trajectory and interested time 



series [9J [221 IB]- However, these work can only identify a list of influential 
keywords for users and do not consider to organize these words into some 
high level meaningful temporal patterns (i.e., events). 

In this paper, we study the problem of detecting priming events from a 
time series index and an evolving document stream. In Figure [U we take the 
weekly approval index of US President Obama from Jan 20, 2009 to Feb 28, 
2010 as an example to illustrate the difficulty of this problem. In Figure [TJ 
the approval index (blue line) evolves and drops from 67% to 45% in the last 
56 weeks. Particularly, in July, 2009, the index dropped from 54% to 48%. 
For a user who is interested in politics and wants to know what event trigger 
this significant change, he/she may issue a query "President Obama" to the 
search engine. But the result will only be a list of news articles indicating the 
events that President Obama participates in during these periods. In Figure 
[U we tag a small part of them on the index. As we can see, in July, 2009, 
there are 5 events including his attempt to reset relationship with Russia, 
help pool nations, visit Ghana etc. Only with this information, we cannot 
fulfill the user's need since we cannot differentiate the role that each event 
plays to change the approval index in that time period. This urges us to 
think about the following questions. What makes an event priming? Does it 
contain some elements which will attract public eyes and change their mind? 
Besides, if such elements do exist, could we find their existence evidence from 
other time period and use them to justify the importance of the local event 
containing them? 

In this paper, we call such elements influential topics and use them as 
basic units to form a priming event. Specifically, we identify the influential 
topics at a global level by integrating information from both a text stream 
and an index stream. Then at a micro level, we detect such evidences and 
organize them into several topic clusters to represent different events going 
on at each period. After that, we further connect similar topic clusters in 
consecutive time periods to form the priming events. Finally, we rank these 
priming events to identify their influence to the index. 

The contributions of this paper are highlighted as follows. 

• To the best of our knowledge, we are the first to formulate the problem 
of detecting priming events from a text stream and a time series index. 
A priming event is an object which consists of three components: (1) 
Two timestamps to denote the beginning and ending of the event; (2) 
A sequence of local influential topic groups identified from the text 
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stream; (3) a score representing its influence on the index. 



• We design an algorithm that first discovers the influential topics at a 
global level and then drills down to local time periods to detect and 
organize the priming events based on the influential topics. 

• We evaluate the algorithm on a real world dataset and the result shows 
that our method can discover the priming events effectively. 

The rest paper is organized as follows. Section [2] formulates the problem 
and gives an overview to our approach. Section [3] discusses how we detect 
bursty text features and measure the change of time series index. Section 
S] describes the influential topic detection algorithm and Section discusses 
how we use the influential topics to detect and organize priming events. 
Section [6] presents our experimental result. Section [7] reviews some related 
work and Section [8] concludes this paper. 

2 Problem Formulation 

Let V = {di,d 2 , ...} be a text corpus, where each document di is associated 
with a timestamp. Let I be the interested index, which consists of \W\ 
consecutive and non-overlapping time windows W. T> is then partitioned 
into \W\ sets according to the timestamp of the documents. Let F be a set 
containing all different features in D, where a feature / e F is a word in the 
text corpus. 

Given the interested index / and the text corpus V, our target is to detect 
the priming events that trigger the movement of the index. As discussed 
in Section [Q, the first step is to discover the influential topics. A possible 
approach [61 [10] is to first retrieve all the topics from the documents using 
the traditional approach in topic detection and tracking (TDT). Then we 
detect the influential topics by comparing the strength of the topic with the 
change of the index over time. We argue that this approach is inappropriate 
because the topic detection is purely based on the occurrence of the words 
and ignores the behaviors of the index. Consider the feature worker. Typical 
TDT approach would consider its co-occurrences with other features when 
deciding to form a topic. By enumerating all the possibilities, it will form a 
topic with features such as union because worker union frequently appears 
in news documents. However, if we take the presidential approval index into 
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consideration, the most important event about worker related to the index 
would be that President Bush increased the Federal Minimum Wage rate for 
workers ever since 1997. This event helped him to stop the continuous drop 
trend of the approval index and make it stay above 30%. Therefore, it is more 
favorable to group worker with wage rather than with union. This example 
urges us to consider how to leverage the information from the index to help 
us organize the features into influential topics. These influential topics take 
in not only the feature occurrence information but also the changing behavior 
of the index. We formally define such topics as follows. 

Definition 1 (Influential Topics) An influential topic 9i is represented 
by a set of semantically coherent features Fq. C F with a score indicating its 
influence on the time series index I . 

Based on the definition of influential topics, the next step is to represent 
priming events using these topics. One simple and direct way is to take each 
occurrence of a topic 9i as a priming event. However, this approach has one 
major problem. We observe multiple topics at a time window w and these 
topics are actually not independent but correlated and represent the same 
on-going event. Our topic detection algorithm may not merge them into a 
single topic because they only co-occur at that certain window but separate 
in other windows. For example, the topic of {strike target} would appear 
together with the topic {force troop afghanistan} in 2001. But in 2003, when 
the Iraq war starts, it co-occurs with the topic {gulf war} instead. Therefore, 
in order to capture such merge-and-separate behavior of topics, we define the 
local topic cluster as follows. 

Definition 2 (Local Topic Cluster) A local topic cluster c w ^ inw consists 
of a set of topics which occur in highly overlapped documents to represent the 
same event. 

Based on the definition of local topic cluster, we further define a priming 
event as follows. 

Definition 3 (Priming Event) A priming event pe consists of three com- 
ponents: (1) Two timestamps to denote the beginning and the ending of the 
event in the window W pe ; (2) A sequence of local topic clusters c Wt i, w G W pe ; 
(3) a score Score{pe) representing its priming effect. 
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Our algorithm is designed to detect event with high priming event score 
and there are three major steps: (1) Data Transformation, (2) Global Influ- 
ential Topic Detection, (3) Local Topic Cluster Path Detection and 4) Event 
extraction by grouping similar topic cluster paths. Details are given in the 
following sections. 

3 Data Transformation 

In this section, we present how we transform and normalize the features F 
and the index /. 

3.1 Bursty Period Detection 

We first discuss how to select features in window w to represent on-going 
events. Given the whole feature set F, we find that the emergence of an 
important event is usually accompanied with a "burst of words" : some words 
suddenly appear frequently when an event emerges, and their frequencies 
drop when the event fades away. Hence, by monitoring the temporal changes 
of the word distributions in the news articles, we can determine whether 
there is any new event appearing. Specifically, if a feature suddenly appears 
frequently in a window w, we say that an important event emerges. 

In order to determine which of the intervals the feature / "suddenly 
appears frequently" , we compute the "probability of bursty" for each window, 
w G W. Let P(f, w; p e ) be the probability that the feature / is bursty in 
window w according to p e , where p e is the probability that / would appear 
in any arbitrary windows given that it is not bursty. The intuition is that if 
the frequency of / appearing in w is significantly higher than the expected 
probability p e , then w is regarded as a bursty period [221 H]- We compute 
P(f,w;p e ) according to the cumulative distribution function of the binomial 
distribution: 



where N w is the total number of words appearing in window w, and n^ w is 
the frequency of / appearing in w. P(f, w; p e ) is a cumulative distribution 
function of the binomial distribution, and P(l; N w ,p e ) is the corresponding 
probability mass function. Therefore, P(f,w;p e ) represents the bursty rate 
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of / in window w. The bursty periods of / are identified by setting a spe- 
cific threshold such that P(f, w; p e ) is larger than the threshold. With the 
transformation, we obtain the bursty probability time series for each feature 
/ G F as p(f) = {p(f, l),p(f, 2), . . . ,p(f, \ W\)} and the bursty windows of 
/ are denoted as Bf. 

3.2 Volatility Transformation and Discretization 

We now discuss how to monitor the change of the index I to reflect the 
effect of priming events. In this paper, instead of looking at the rise/fall of I 
solely, we study the change of volatility of time series [TjJJ IB] . Volatility is the 
standard deviation of the continuously compounded returns of a time series 
within a specific time horizon and is used to measure how widely the time 
series values are dispersed from the average as follows: 



Y, [Ri - E{R i )fP l (2) 

where B4 is the possible rate of return, E(Rj) is the expected rate of return, 
and Pi is the probability of R4. We can transform the index time series I to 
the volatility of time series, VI = {Wi, W|vp|}- 

Given the volatility index VI, we observe that there are some abnormal 
behavior at certain time windows. For example, in the 911 event, there is 
a volatility burst for President Bush's approval index. Such phenomena will 
bring tremendous bias to the events happening in these volatility bursty win- 
dows. In order to avoid such bias, we further transform the volatility index 
VI to obtain a discrete representation with equal probability |14j . Accord- 
ing to our experiment, the volatility index can fit into a logistic distribution. 
Therefore, we can produce a equal-sized area under the logistic curve [16] 
and transform the index volatility time series into a probabilistic time series 
PVI={PVI u ...,PVI m }. 

4 Global Influential Topic Detection 

Given the feature set F and the probability volatility index PVI, our task 
here is to identify a set of influential topics {6\, ■■■,0k}, where each topic 9k 
is formed by a set of keywords F 9k = fe k ,i, fe k ,\e k \- The problem can be 
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solved by finding the optimal 6k such that the probability of the influential 
bursty features grouped together is maximum for the text stream T> and 
PVI. Below, we formally define the probability of 6 k : 

Definition 4 (Influential Topic Probability) The probability of an in- 
fluential topic 6k is given by 

P(6 k \V, PVI) p{v pvi) • (3) 

Since P(T>, PVI) is independent of 6 k , we only consider the numerator 
of Eq. (jHJ). We use the topic 6k to account for the dependency between V 
and PVI. Therefore, given 6 k , T> and PVI are independent. Our objective 
function then becomes: 

max P(PV I. V\6 k )P(6 k ) = max P (PV I \6 %)P(V\6 k )P '{6 '*), (4) 

Some observations can be made on Eq. (jlj). The second component 
P{V\6 k ) represents the probability that the influential topic generates the 
documents. And intuitively, we expect the document overlap of the features 
from the same topic to be high. The third component of P(6k) represents the 
probability of the features to be grouped together. And two features should 
be grouped together if they usually co-occur temporally. Therefore, these 
two components basically require the features of 6 k to be coherent at both 
the document level and the temporal level. So generally, if more features are 
grouped together, the values of the second and the third components will de- 
crease. And the first component P(PVI\6k) represents the probability that 
the influential topic triggers the volatility of the time series index. Obvi- 
ously, if the features in the group cover more windows with high volatility 
probability, the value of the first component will be higher. This will make 
the algorithm look for the features with high potential impact on the index. 
Below, we show how we estimate these three components. 

First, we define the document similarity of two features using Jaccard 
Coefficient [3]. 

sim(f i ,f j )= l^ n ^ /j i , (5) 
yjiuj) UD f .\ 

where Df i is the document set containing feature fi. Then the P(V\6 k ) can 
be estimated as below: 

P<P\O k )= I £ simifijj). (6) 
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Second, in order to compute P(9 k ), we estimate the temporal similarity of 
two features by comparing their co-occurrence over the whole set of windows 
W as below: 

(f . n= \p(fi)-p(fj)\ (7) 

where p(fi) = {p(fi, l),p(f i: 2), ...,p(fi, \W\)} is the bursty probability time 
series of /, computed in Section 13.11 Then the probability of a set of features 
belonging to 9k can be estimated by the average similarity for each pair of 
features: 

p (d k )= lF ■ * | n E Pifufi)- (8) 

Finally, in order to estimate the P(PVI\9k), we define the influence prob- 
ability for a feature P(PVI\f) as: 

P(PVI\f) = EweBfPVI Z T • (9) 

Since the denominator of Eq.([9]) holds the same for all the features, we just 
take the numerator for computation. And P(PVI\9 k ) can be estimated as 

P(PVI\9 k ) = E P(PVI\f). (10) 

Finally, the topic can be extracted using a greedy algorithm by maximiz- 
ing Eq. (jl]) for each topic in a similar way as in [6]. 

However, Eq. (jl]) is different from the objective function defined in [6] 
since we extract topics with respect to an interested time series index I 
rather than purely based on text documents. Consider the worker example 
again. The document similarity and the temporal similarity of worker and 
union are 0.31 and 0.25, while those of worker and wage are 0.35 and 0.1. 
If we do not consider the index / by setting P(PVI\9 k ) = 1, by Eq. (|9]), 
P(worker,union\T), PVI) = 0.31x0.25 = 0.0775 and P (worker, wag e\T>, PV I) 
0.35 x 0.1 = 0.035. As a result, the algorithm would combine worker and 
union. However, since P(PVI\union) = 30 and P(PVI\wage) = 74, by con- 
sidering the feature influence to the index /, we have P(worker, union\T>, PVI) = 
0.0775 * 30 = 2.325 and P (worker, union\V, PVI) = 0.035 * 74 = 2.59. In 
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this way, the algorithm will instead group worker and wage together to make 
an influential topic with respect to /. 

As shown above, since influential topics carry the volatility index informa- 
tion, it brings benefits to the priming event detection. In the above example, 
if we detect a new event containing the common topic {worker, union}, the 
event may be trivial since the topic has a low influence probability in the 
history But if we detect an event with {worker, wage} instead, this event 
has a higher probability to be a priming event. This is because the high 
influence probability of the topic indicates that the events with such topic 
attracted the public attention and changed people's mind in the past. 

After extracting each 9k, the bursty rate of 6% in a window w can be 
computed as below: 

P{9 k ,w) = -±- £ p(f,w) (11) 

The bursty period of 9^ is determined in a similar way to detecting the 
bursty period of features in Section 13.11 And we use 9 W to denote all the 
bursty topics in window w. 



5 Micro Priming Event Detection 

A priming event pe consists of three components: (1) Two timestamps to 
denote the beginning and the ending of the event in the window W pe ; (2) A 
sequence of local correlated topic clusters C w , w G W pe ; (3) A score represent- 
ing its influence on the index which is defined by considering the following 
three aspects of information: 

1. High event intensity B pe : The event contains high bursty topics. The 
intensity at each window w is estimated as follows: 

B P e, w = ^ E Ptfk,w) (12) 

And the intensity over the whole period is measured by ||-B pe || where || 
means taking 2-norm on the vector. 

2. High index volatility: The index must have high volatility during the 
bursty period of this event which is measured by ||PV/ pe ||. 
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3. High index and event co-movement rate: The index and event intensity 
time series should be highly correlated which is measured by their linear 
correlation Corref(B pe , PVI pe ). 

We combine these three measures and define the priming event score as 
below: 

Score ( pe) = \\B pe \\ ■ \\PVI pe \\ ■ Corref(B pe , PVI pe ) (13) 

In this section, we proposed a two-phase algorithm to detect events with 
high priming event score. We first look for potential important topic cluster 
at each local time period by extracting and grouping bursty topics. (High 
event intensity). We then take the topic cluster at the period with most 
significant index volatility as seed and probe the whole priming event path 
(High index volatility and High index and event Co-movement rate). Below, 
we discuss the two phases of the algorithm in detail. 

5.1 Local Topic Cluster Detection 

As discussed in Section [21 we usually observe multiple correlated bursty topics 
at a time window w representing the same event. Therefore, we first group 
the correlated topics into topic clusters at each time window w. 

Intuitively, if two topics 6,i and 9j belong to the same event, the reporter 
usually discusses them in the same news article and they would have a high 
degree of document overlap. We first define the document frequency vector 
for a topic 6 k at a window w. 

Let 6 W be a set of bursty topics in window w and D w be the set of 
documents in window w. We define Df <w = {df )W} i, df, w ^ df,w,\D w \} be the 
term frequency vector for feature / in the documents in window w. 

Then, the document frequency vector for a topic 6k at a window w is 
computed by the average of the term frequency vector as below: 

De k , w = r^r E D f, W - (14) 

Then the similarity between two topics 0, and 6j at window w can be 
estimated by the cosine similarity: 

sim(6i, 6j, w) = cos(D euW , D 6j , w ). (15) 
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Seed Cluster Topics 
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w+1 



Figure 2: Topic Clusters Probe 

In order to cluster the set of topics 6 W into several topic clusters, we 
use the K- Means clustering algorithm [3]. We determine the optimal cluster 
number by examining the quality of the clustering result under different 
cluster number k. And the quality of the clustering is measured based on 
the ratio of the weighted average inter-cluster to the weighted average intra- 
cluster similarity [17] . After clustering, for a window w, we obtain a set of 
topic clusters C w = {c Wt i,c Wt 2, •••}• 

5.2 Composite Topic Cluster Path Detection 

The detected topic cluster c w ^ at each window represents a snapshot of an 
event. And the remaining question is how could we utilize these topics clus- 
ters and connect them into a topic cluster path to represent a complete 
priming event. 

Intuitively, the time period with high index volatility would have better 
chance to contain the right priming event. Therefore, we sort the time periods 
according to the index volatility probability in Section Eland start from topic 
clusters in the time period with the highest volatility probability. Figure [2] 
illustrates the idea where the bottom part is the volatility index and the upper 
part is the detected topic clusters at three consecutive windows. As we can 
see, window w has the highest volatility and therefore the two topic clusters 
in w will be taken as seed topic clusters. Then, we will start to construct 
the priming event from these two topic clusters. Specifically, we will probe 
forward/backward and associate similar and appropriate topic clusters into 
a topic cluster path P. 

Here, we measure the similarity between two topic clusters in two consec- 
utive windows, c w ^ and c w ^ij by looking at their intersection of influential 
topics. More specifically, the similarity of two topic clusters can be computed 
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by adding the topic probability P(9k\T>, PVI) as a weight to the Jaccard Co- 
efficient. 




P(9 k \V,PVI) 



(16) 



svmic, 



p(e k \v,pviy 



Eq. (fT6|) assigns a higher similarity score to two topic clusters whose 
overlapping topics have higher topic probability, i.e., the topics are more 
coherent and influential. 

However, even if two cluster have a high similarity score, we still can not 
link them directly since the topic cluster path with the new topic may have 
worse quality compared to the origin one. Here, we measure the quality of 
a topic cluster path using the same measure as for the priming event, which 
is defined in Eq. H3J For example, in Figure [2J the link between C Wj i and 
C w+ i_i means that they have a high similarity with each other, however, we 
found if we extend the path to w + 1 by connecting C w>1 with C w+ i,i, then 
event intensity is still very high at window w + 1 but the index volatility 
has decreased. This results lower correlation value between the topic cluster 
path and volatility index and eventually reduces the priming event score. 
Therefore, we will only connect two topic clusters if 1) they have a similarity 
score sim(c Wt i, c w+ ij) higher than a predefined threshold a. 2) the topic 
cluster path score improves by integrating the new topic cluster. 

After connecting appropriate topic clusters, the algorithm form a directed 
acyclic graph (DAG) of topic clusters between windows and we formally 
define a Path P in this graph as follows. 

Definition 5 (Topic Cluster Path) A topic cluster path P of length I in 
a topic cluster graph is a sequence of I clusters: — > c W2j i 2 —>•...—>• c Wu i i; 
such that {u>i, . . . , wi} are I consecutive windows and there is an edge between 
two consecutive clusters in the graph. 

In addition, the topic cluster path may have overlap between each other 
which they may express different aspects of the same priming event. For 
example, in the gulf war event, one topic cluster path may show the progress 
of the battle in Iraq, while another path may record the actions from US's 
allay. Therefore, we measure the similarity between two overlapping paths 
as follows: 




Pi n Pi 



(17) 



13 



Algorithm 1 DiscoverPrimingEvents 



Input: News document stream D and index / 
Output: priming events PE 
1: for every w do 

retrieve the burst topics at to, 6 W 
C w <- KMeans(9 w ) 
Sign Wti = true 
end for 

for every u> in PVI W descending order do 
IP = {} 

for each c Wji G C w and Sign wi do 
generate a new Path 
add Pfe to inverted path list IP Cw t 
Sign W: i = false 
end for 

P = ProbeEventPath(C, P, IP, w, Sign) 
end for 

compute sim(Pi, Pj) for every P i: Pj G P according to EgJlTl 
return PE = Cluster(sim) 



If the similarity between two paths sim(Pi,Pj) is high, then we group 
these two paths and form a priming event. Here we explore the agglomerative 
hierarchical clustering [3] to conduct grouping and extract events. 

Algorithm [1] describes the whole process of how we detect priming events. 
First of all, for each time window w, we retrieve the bursty influential topics 
9 W in Line 2. Then Line 3 groups the topic into topic clusters C w as de- 
scribed in Section 15.11 Here we keep a Sign to indicate whether the topic 
has been used to construct priming event. And Line 6-14 is the key process 
to construct the priming events by connecting the topic clusters in consecu- 
tive windows. As discussed before, it will start from the window with highest 
index volatility. In Line 8-12, for each seed window w, we check the unused 
topic cluster and generate a new path for each of them. Here, we also main- 
tain the inverted path list for c w ^, IP Cwi - With this preparation, in Line 
13, we probe event path which is described in detail in Algorithm |2j After 
we discover all the paths, we compute the topic cluster path similarity by 
Eq. [T7]in Line 15 and group them together according to hierarchy clustering 
algorithm in Line 16. 
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Algorithm [2] shows how to probe the whole event path starting from a 
seed window sw. The algorithm is conducted both forward and backward. 
Let's take the forward situation as an example where we try to see whether 
we can extend a path to window w + 1. And for each cluster in C w+ i, c w+ ij, 
we compare c w+ \j with all the topic cluster with an entry in IP (i.e., it is a 
path to be extended) by computing cluster similarity according to Eq. ( TIB]) 
in Line 5. If their similarity is higher than a, then in Lines 7-15, we check 
all the paths in the path inverted list of c Wj i and decide whether to extend 
the path by checking whether the priming event score would increases by 
integrating c w+ ij into each path. Besides, we also maintain the new inverted 
path list of c w+ ij, NIP Cw+1 . . Finally, in Lines 19-20, we use the new inverted 
list NIP to replace the previous one IP and move the window forward. The 
same will apply for the back-ward topic cluster path probing operation. The 
algorithm will return a list of update topic cluster path P. 

6 Experimental Results 

In this section, we evaluate our algorithm on real- world dataset from political 
and finance domain. 

We archive the news articles from the ProQuest database^. In ProQuest, 
for political news, we take "President Bush" and "President Obama" as the 
query keywords and extract 15, 542 and 1, 643 news articles respectively from 
Jan. 1, 2001 to June. 1, 2010. For the financial news, we want to study the 
priming events during the financial crisis period, so we take "Finance" as the 
query keywords and extract 9,416 news articles from June. 1, 2007 to June. 
30, 2010. 

For the preprocessing of these news articles, all features are stemmed us- 
ing the Porter stemmer. Features are words that appear in the news articles, 
with the exception of digits, web page address, email address and stop words. 
Features that appear more than 80% of the total news articles in a day are 
categorized as stop words. Features that appear less than 5% of the total 
news articles in a day are categorized as noisy features. Both the stop words 
and noisy features are removed. 

For the time series index, we archive the President Approval rating index 
from the Gallup Poll And the poll is taken every 10 days approximately. 
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Bush 


Obama 


Finance 


bin laden 
north Korean 
Israel Palestinian 
immigration illegal 
destruct mass 


oil Mexico 

civil movement equal 
immigration illegal 
police arrest 
small business lend 


Moody Triple mature 
Merrill lynch 
legman brother 
barrack obama primary 
Germany 



Table 1: Influential Topics 

And for President Obama, we take a weekly average rating based on the 
Gallup daily tracking. For the financial index, we use the popular U.S. 
market index, the S&P 500 index and also take a weekly sample based on 
its daily index. We further identify the bursty probability time series and 
volatility time series according to the methods introduced in Section |3~T1 and 
Section 13.21 

After these preprocessing, we have three real datasets. 

• Bush. It contains 1186 bursty feature probability streams and 1 volatil- 
ity time series with equal length of 281 in the 8 years of Bush's admin- 
istration. 

• Obama. It contains 1197 bursty feature probability streams and 1 
volatility time series with equal length of 56 in the 13 months of 
Obama's administration. 

• Tsunami. It contains 961 bursty feature probability streams and 1 
volatility time series with equal length of 156 in the 3 years of financial 
crisis. 

We implemented our framework using C# and performed the experiments 
on a PC with a Pentium IV 3.4GHz CPU and 3GB RAM. 

6.1 Identifying Influential Topics 

With the algorithm in Section HI for political domain, we identify 385 and 
207 influential topics from Bush and Obama, respectively. And for finance 
domain, we identify 100 influential topics from Tsunami. Table [1] gives the 
top 5 influential topics and the rank is based on the influential topic proba- 
bility in Eq. fl3]). As shown in the second column of the table, {bin laden}, 
{north Korea} are 2-gram names which are regarded as the largest threaten 
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for Bush's administration from 2001-2008. Each of the other three topics 
consists of a set of features which are not 2-gram names but coherent and 
influential keywords in the document stream. The third column of the table 
shows the influential topics from Obama. Compared with Bush's where 4 
out of 5 topics are about international affairs, there are significant evolution 
of the influential topics of Obama's topics. In particular, only first topic of 
Obama is about the international issue, i.e., the oil slick in Mexico Gulf . 
Others are all about how he deals with domestic affairs. The fourth column 
shows the influential topics we detect from Tsunami. The first topic {Moody 
Triple } shows the common triggering reason for the market crash, i.e., when 
the Moody changes the Triple for the big and matured financial institutions, 
the investors got panic and cleaned their position quickly. The second and 
third topics shows two big investment banks where the Legman Brother went 
bankruptcy while Merrill Lynch was acquired by Bank of America after losing 
a lot in the financial crisis. Their destinies are watched by all the investors 
in the market and therefore made into influential topics. The forth and fifth 
topics are two parties whose actions are critical to the direction of the mar- 
ket. The first one is President Obama who leads the policy of United States 
and the second one is Germany which influence the decision on bailout plan 
of Europe. 

As discussed before, these influential topics detected at a global level give 
us evidences in detecting priming events at a micro level. 

6.2 Priming Events in Politics 

With the influential topics, we can identify the priming events with by the 
algorithm in Section Figure E] shows the approval index of President Bush 
and the top 10 priming events automatically detected over his eight years 
administration. As shown in the upper part of Figure [3j the approval index 
starts from 57% and has a big jump up to 90% in Sep 2001. Then it drops 
quickly until a rebound back to 70% in 2003. After that, it continues dropping 
with some small rebound in the middle and eventually reaches 34% in 2008. 
In the lower part, the blue line shows the volatility index according to Section 
13.21 and the colored waves represent the ranked priming events (we normalize 
the value of these two time series and plot them in same graph). The rank 
is based on the score given by Eq. ffl3|) and the wave with deeper color 
represents the priming event with a higher rank. The magnitude of the 
wave represents the intensity of the event according to Eq. f[T2l . From the 
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Figure 3: Top 10 Priming Events of President Bush 
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Figure 4: Priming Event: 911 Terrorist Attack 

figure, we have two obvious observations: 1) the value of the volatility index 
increased when a significant trend change happened for the approval index. 2) 
During the periods when the volatility index increased significantly, we detect 
a burst of priming events. For example, after September 2001, we observe 
a significant increase of the volatility index reflecting the big jump of the 
approval index. At the same time, we detect the top 1 priming event about 
the 911 terrorist attack. Figure H] further shows its structure which contains 
a composite topic cluster path with a length of 9 starting from window 22 to 
window 30. In each window, the path contains a topic cluster with a set of 
topics lying in a highly overlapped document set. For example, in window 
22, the cluster consists of 5 bursty topics including the influential topics of 
{bin laden} and {Sept Terrorist Attack} that indicate the start of the 911 
event. After that, in window 23, we observe another cluster containing {bin 
laden} and we connect it with the previous one since {bin laden} has a 
high influential topic probability and dominates the topic similarity measure 
according to Eq. f fl6|) . We can also see a certain degree of evolution between 
these two topic clusters since the second cluster contains a new topic {force 
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Figure 5: 8 Priming Events of President Obama (2009.1-2010.6) 

troop Afghanistan} that is about U.S. sending troop to start the war. In 
the following windows, we can see the topic clusters evolute but all contain 
the influential topic {bin laden}. This event ends in window 30 with a topic 
{civilian death} indicating that the war results in the civilian death of the 
country. From this priming event, we can see that the attack makes the U.S. 
people united and support their President. Similarly, the volatility index 
also reflects the rebounding of the approval index in 2003. And the priming 
event covering that period is about the Iraq war with the ending of U.S. 
victory which increased the public support of President Bush. Other periods 
with significant increasing volatility such as those in Mar 2001 and Jul 2004 
can also be explained by the detected priming events, i.e., President Bush 
released the energy plan and won the mid-term campaign over John Kerry. 

Figure shows the approval index of President Obama and the 8 priming 
events detected over his 16 months administration. As discussed in Section (TJ 
users may be curious about what happened in the quick drop periods in July, 
2009. In the lower part of Figure |5l when the volatility increased, we detect 
the priming event that he criticized on the police who arrested Prof. Gate. 
On the other hand, the algorithm also detects that his effort to pass the 
legislation on unemployment in April, 2010 helped to improve his approval 
rate. 



19 




June Jan June Jan June Jan June 




f„™ i June Jan June Jan June 



Figure 6: Top 10 Priming Events of Financial Tsunami (2007.6-2010.6) 
6.3 Priming Events in Finance 

Figure [6] shows the S&P 500 index and the top 10 priming events detected 
over the three years of financial crisis period. 

In the upper part of Figure M, we can see the S&P 500 index reached the 
top of 1562 in Oct., 2007 and started to turn down. At the same time, In the 
lower part, the volatility increased and we detect the event that subprime 
mortgage crisis started to emerge. As the crisis developing, the biggest index 
drop of 300 points is achieved in Oct., 2008 before which we detected the 
bankruptcy event of the legman brother. The left part of Figure [7] shows 
its structure. In its three evolving windows, the influential topic {Lehman 
Brother} always exists. And at the first window, it mentioned its possibility 
to fall down as Bear Stearns. In the following two windows, it integrated the 
bankruptcy topic as well as a topic about reaction from Europe. After this 
event, the index kept dropping and achieved the bottom of 683 in March, 
2009. And it started to rebound with the detected event that President 
Obama released the stimulation plan to re-build the confidence of investors. 
Recently, we also observed two priming events which are about the European 
Fiscal Deficit Crisis. The right part of Figure [7] shows its structure which 
has a length of 2. At the first window, we detected the event of European 
Fiscal Deficit and at the second window we integrate a new topic cluster of 
{Germany} indicating that Germany played a major role in saving the crisis. 
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Figure 7: Priming Event: Financial Crisis 
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Figure 8: Event Quality Comparison 
6.4 Event Quality Comparison 

In order to further evaluate the performance, we implement a baseline ap- 
proach as comparison which is based on the bursty events directly [6]. Ac- 
cording the approach, the bursty period of the event is decided by: 

P(w,E k ) = T }-r PUM (18) 

where is similar to the influential topic identified in this paper but it 
do not consider integrating the index volatility into the detection process. 
Given events detected by the baseline approach and the proposed approach, 
we compared their average priming event scores. Figure [5] shows the result 
and our proposed approach outperforms the baseline approach for the three 
datasets. And we also note that for the Bush and Obama dataset, the base- 
line approach shows negative value which means the detected event can not 
match well with the volatility index and therefore produce a lot of events 
with negative priming event score. Besides, we can see the average score of 
events on Tsunami dataset is much larger than the average score on Bush 
and Obama dataset. This is because the volatility of financial index is much 
larger than the volatility of President approval index which leads to a larger 
priming score according to Eq. [I3J 
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7 Related Work 

The problem of Topic Detection and Tracking (TDT) [I] is a classical research 
problem for many years. The first stream of work used graphical probabilistic 
models to capture the generation process of document stream [2, [7]. [20] ex- 
tended the LDA model and incorporated location information. [2T] analyzed 
multiple coordinated text streams and detected correlated bursty topics. [15] 
added the social network information as a regulation into the topic detection 
framework. On the other hand, there are another stream of work which de- 
tected topic based on the bursty features [UJ. [6] detected bursty features 
and clustered them into bursty events. [I] further built an event hierarchy 
based on the bursty features. [TO] analyzed the characteristics of bursty fea- 
tures (power and periodicity) and detected various types of events based on 
it. [13] proposed an algorithm to track short phrased and organized them 
into different news threads. Although these work can detect topics and track 
event efficiently. They can not tell the users which topics/events are chang- 
ing the real world's interested time series, President Approval Rating, Stock 
Market Index. 

In contrast, there are some work studying the relationship between the 
text stream and interested time series. [12] studied the relationship between 
various topics over the volatility of presidential approval index. [0J attempted 
to find out the relationship between the online query and its related sales 
rank. [5] proposed a model for mining the impact of news stories on the 
stock prices, by using a t-test based split and merge segmentation algorithm 
for time series preprocessing and SVM for impact classification. [19] studied 
the relationship between the news arrival and the change of volatility of 
stock market index. [IB] further explored this relationship to rank the risk of 
stocks. However, these work are either relying on analysts to extract topics 
manually or just identifying a list of influential keywords. In our work, we 
attempt automatically identify these events by first organizing these words 
into high level influential topics and then detect priming events based on 
them. 

8 Conclusion 

In this paper, we study the problem of detecting priming events based on a 
time series index and an evolving document stream. We measure the effect 
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of the priming events by the volatility rate of the time series index. We 
propose a two-step framework to detect the priming events by first detecting 
influential topics at a global level and then forming priming events using 
detected influential topics at a micro level. The experimental results on the 
political and finance domains show that our algorithm is able to detect the 
priming events that trigger the movement of the time series index effectively. 
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Algorithm 2 ProbeEventPath 

Input: Topic Cluster C, Topic Cluster Path P, Seed Inverted Path List 

SIP, seed window sw, Sign 

Output: updated Topic Cluster Path P 

1: w = sw and IP = SIP // + for right side 

2: while IP is not empty do 



3: for each c w+ ij G C w+ i and Sign w+ ij do 

4: for every c w ^ has an entry in IP do 

5: Compute topic similarity Sim(c Wji ,c w+ i ! j) according to Eq. (TB] 

6: if sim(c W; i, c w+ i t j) > a then 

7: for each path Pk in the inverted path list IP Cw ; do 

8: compute score(P] C , c^+ij) according to Eq. [16] 

9: if scovci^Pk, Cyj+i j) > P^. score then 

10: Pk. score = score(Pk,c w+ ij) 

11: Add c w+ ij to Pfc 

12: Add Pk to new inverted path list of c w+ ij, NIP Cw+1 , 

13: Sign w+ ij = false 

14: end if 

15: end for 

16: end if 

17: end for 

18: end for 

19: IP = NIP 

20: W = W + 1 



21: end while 

22: w = sw and IP = SIP // + for left side 

23: ... 
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